Speech-to-Text Conversion

Techniques, Tools, and Trade-offs

Omkar Ninav
June 30, 2025

What is Speech-to-Text?

  • Converts spoken language into written text
  • Also known as Automatic Speech Recognition (ASR)
  • Used in applications like:
    • Virtual assistants (e.g., Siri, Alexa)
    • Captioning & subtitling
    • Meeting transcription
    • Voice commands

How Speech-to-Text Works

  1. Audio Input
    Voice signal captured from a microphone or file

  2. Feature Extraction
    Converts raw audio into numerical features (e.g., MFCCs, spectrograms)

  3. Acoustic Model
    Maps features to phonemes or characters using ML/DL models

  4. Language Model
    Predicts word sequences for context-aware transcription

  5. Decoder
    Aligns acoustic and language models to produce final text

flowchart LR
  A[Audio Input<br/>Voice signal] --> B[Feature Extraction<br/>MFCCs / Spectrograms]
  B --> C[Acoustic Model<br/>ML/DL-based]
  C --> D[Language Model<br/>Contextual Prediction]
  D --> E[Decoder<br/>Final Text Output]

Techniques: Traditional ASR

  • HMMs + GMMs
    Early models for mapping acoustic features to phonemes

  • n-gram Language Models
    Predict word sequences based on prior word probabilities

  • Feature Extraction (MFCCs)
    Converts raw audio into spectral features

Python Libraries for Speech-to-Text (1/2)

  • Whisper (OpenAI)
    • Transformer-based, multilingual
    • High accuracy, runs offline
  • SpeechRecognition
    • Simple API wrapper for Google, IBM, etc.
    • Easy for beginners
  • Wav2Vec 2.0 (Hugging Face)
    • Pretrained self-supervised models
    • High-quality transcriptions

Python Libraries for Speech-to-Text (2/2)

  • DeepSpeech (Mozilla)
    • Lightweight and fast
    • Less accurate on noisy inputs
  • Kaldi (via PyKaldi)
    • Research-grade toolkit
    • Steeper learning curve

Model Comparison: Features at a Glance

Model Accuracy Offline Multilingual Ease of Use Cost
Whisper ✅✅✅ ✅✅✅ ✅✅ Free
Wav2Vec 2.0 ✅✅ ⚠️ (mostly English) Free
Google API ✅✅✅ ✅✅✅ ✅✅✅ Paid
DeepSpeech ✅✅ Free
Kaldi ✅✅ ✅ (with effort) ⚠️ Complex Free

Cost: Open-Source Models

Model Cost Offline Use Cloud Required
Whisper Free (Open Source)
Wav2Vec 2.0 Free (Open Source)
DeepSpeech Free (Open Source)
Kaldi Free (Open Source)

Cost: Cloud APIs

Service Approx. Cost Offline Use Cloud Required
Google Speech API ~$1.44 per hour
AWS Transcribe ~$1.44 per hour
Azure Speech ~$1.60 per hour

Cost – Summary & Recommendation

  • Open-Source Models (Whisper, Wav2Vec, etc.)
    • ✅ Free and offline
    • ⚠️ Require setup and local resources
  • Cloud APIs (Google, AWS, Azure)
    • ✅ Easy to use, scalable
    • ⚠️ Ongoing cost and privacy trade-offs

✔️ Recommendation:
Use Whisper or Wav2Vec 2.0 for local, cost-effective transcription
Use Cloud APIs only for real-time or highly multilingual needs

Conclusion & Summary

  • STT is a mature, versatile technology
  • Open-source tools (Whisper, Wav2Vec) offer high quality with no cost
  • Cloud APIs provide convenience but incur recurring costs
  • Model choice depends on:
    ✅ Accuracy
    ✅ Cost
    ✅ Deployment constraints
    ✅ Privacy needs

✔️ Recommended:
Use Whisper for secure, offline transcription
Use cloud APIs only where real-time & scalability are critical

Thank You!

Questions? Suggestions?

🔗 View this presentation on GitHub:

Presented by Omkar Ninav — June 2025